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Abstract 

Data  on  short-term  user  file  reference  patterns  have  been  collected  from  a  local  UNIX  system  supporting 
university  research.  These  data  provide  detailed  information  on  file  opens  and  directory  accesses.  Somewhat 
coarser  information  on  internal  file  operations  has  also  been  collected. 

An  analysis  of  the  data  shows  that  referenced  files  in  our  environment  are  generally  small  (under  KXX) 
bytes),  usually  completely  read  or  written,  and  have  short ‘interopen  intervals  (a  succeeding  open  to  a  file 
usually  occurs  within  60  seconds  of  the  last  open).  In  addition,  while  there  is  extensive  sharing  of  some 
files,  this  sharing  is  restricted  to  sundard  system  files.  We  see  very  little  sharing  of  user  files. 

This  paper  emphasizes  the  implications  the  patterns  we  observe  have  for  distributed  file  systems.  However, 
the  results  may  also  be  applied  to  u.e  design  and  modeling  of  more  traditional  file  systems. 


This  work  was  supported  in  part  by  the  National  Science  Foundation  under  grant  number  DCR-8320I36 
and  in  pan  by  the  Office  of  Nav?l  Research  under  grant  number  N00014-82-K-0193. 


1.  Introduction 


Much  work  has  been  done  in  recent  years  on  transparent  distributed  file  systems  (DFS’s)  for  local  area 
networks  [Ellis  83,Satyanarayanan  85,Tichy  84.  Walker  83j.  These  DFS  s  typically  provide  transparent 
name  lookup,  transparent  file  access,  and  facilities  for  automatic  caching  and  migration  of  files. 
Understanding  and  improving  the  behavior  of  such  DFS's  has  been  hampered  by  a  lack  of  information  on 
the  ways  that  they  are  used.  In  particular,  there  is  very  little  data  available  on  short-term  file  and  directory 
usage  patterns. 

Inspired  and  frustrated  by  this  lack  of  information,  we  instrumented  a  local  UNIX'  system  to  collect 
information  on  file  system  requests.  The  UNIX  file  system,  is  particularly  appropriate  for  a  study  such  as 
this  one  because  it  places  relatively  few  constraints  on  user  behavior.  In  addition,  all  of  the  DFS's 
mentioned  above  used  the  UNIX  file  system  as  their  design  model.  We  logged  directory  accesses,  file 
opens,  file  closes,  and  information  on  process  creation  and  destruction.  In  addition,  the  amount  of 
information  read  and  written  for  opened  files  was  logged. 

This  paper  describes  the  data  collection  method,  presents  our  analysis  of  short  term  file  reference  patterns 
and  discusses  how  the  results  may  be  applied  to  the  design  and  tuning  of  DFS's.  A  companion  paper 
[Floyd  86b)  presents  an  analysis  of  short  term  directory  reference  patterns.  We  have  generally  tried  to 
present  results  in  a  way  that  gives  a  qualitative  feel  for  the  characterisucs  of  the  data  we  have  measured. 
Quantitative  fits  and  distributions  are,  for  the  most  part,  sacrificed  in  favor  of  observauons  that  would  aid 
in  developing  and  operating  DFS's.  These  results,  along  with  a  simulation  driven  by  the  data  we  have 
collected,  will  be  used  to  investigate  the  performance  of  the  Roe  distnbuted  file  system  [Floyd  Sba). 

Our  work  is  novel  in  several  respects.  It  is  by  far  the  most  detailed  study  of  short  term  UNIX  file  reference 
patterns  that  has  been  done  to  date.  It  is  also  the  only  study  we  have  seen  that  examines  the  differences 
between  important  user  and  file  classes.  In  addition  to  examining  the  overall  request  behavior,  we  have 
broken  down  references  by  the  type  of  file  (temporary,  log  and  permanent),  owner  of  file  (system,  user  and 
net^).  and  requester  (system,  user  and  net).  We  see  large  differences  in  behavior  for  the  various  classes. 
Knowledge  of  these  differences  should  be  useful  in  designing  future  DFS's. 

Section  two  of  this  paper  surveys  previous  work  in  the  area.  Section  3  describes  the  environment  in  which 
our  measurements  were  made.  Secuons  4  and  5  present  an  overview  of  the  data  collection  and  analysis 
methods.  In  section  6.  we  present  some  of  the  results  of  the  analysis.  The  implications  the  results  of  this 
analysis  have  for  DFS  design  arc  discussed  in  section  7.  Section  8  describes  further  analysis  and  data 
collection  that  could  be  done  and  section  9  summan/cs  our  results.  Finally,  in  the  appendices,  we  present 
results  that  are  too  detailed  for  the  main  body  of  the  paper. 

Familiarity  with  UNIX  [Ritchie  78]  is  assumed.  Knowledge  of  4.2BSD  UNIX  [Joy  83]  may  also  be  useful. 


LMX  Ls  a  trademark  of  AT&T  Bell  Laboratones 
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2.  Previous  Work 

Early  studies  of  file  reference  patterns  [Satyanarayanan  81,  Smith  Sl.Stritter  77]  concentrated  on  long  term 
(days  or  weeks)  reference  patterns  that  could  be  used  in  designing  archival  migration  policies.  The  relatively 
small  lime  delay  of  a  local  area  network  makes  migration  on  a  much  smaller  time  scale  feasible. 

Recent  work  has  concentrated  on  short  term  (seconds)  file  reference  patterns.  Porcar  studied  what  was 
primarily  batch  activity  on  IBM  mainframe  systems  [Porcar  82 j.  Our  study  is  of  an  interactive  UNIX 
system,  an  environment  that  has  considerably  different  reference  patterns.  The  differences  between  Porcar  s 
results  and  ours  are  discussed  in  section  6.1.  Satyanarayanan  has  made  measurements  on  an  interactive 
DEC-10  system  (Satyanarayanan  83],  with  events  being  treated  anonymously.  Unfortunately,  anonymous 
references  make  intelligent  migration  difficult.  Smith  has  also  done  a  study  of  short  term  file  reference 
patterns  [Smith  85],  but  pnmanly  at  the  disk  track  level.  We  are  concerned  here  with  logical  operations 
(opens,  closes  and  so  on)  at  a  higher  level  of  the  system  and  less  concerned  with  "tuning"  using  block 
caches  and  other  such  methods. 

WoiK  done  on  4.2BSD  UNIX  file  reference  patterns  at  Berkeley  is  more  closely  related  to  our  work.  file 
system  tracing  descnbed  by  Zhou  et  al.  [Zhou  85]  is  similar  in  approach  to  ours.  The  differences  between 
the  two  packages  are  described  in  section  4.  Ousterhoul  et  al.  report  on  read/write  characteristics,  file  sizes, 
and  data  lifetimes  for  several  4.2BSD  UNIX  systems  [Ousterhout  85].  Their  results  in  these  areas  are 
similar  to  ours  and  are  discussed  in  more  detail  in  section  6.1. 

None  of  these  previous  studies  have  collected  information  on  directory  access  patterns.  This  information  is 
not  needed  in  systems  that  are  concerned  primarily  with  migration  to  manage  disk  storage,  since  files  are 
typically  much  larger  than  the  directories  that  reference  them.  However,  a  DFS  may  also  migrate  and 
replicate  directories  to  improve  performance  and  availability.  In  a  DFS  with  non-trivial  directory 
structures,  the  overhead  of  directory  access  is  an  important  performance  consideration.  Evaluating 
directory  design  decisions  in  the  absence  of  data  on  reference  patterns  is  difficult.  Section  4  of  this  paper 
describes  the  directory  data  collected  by  our  tracing  package. 

Previous  studies  have  also,  for  the  most  part,  ignored  the  distinctions  between  batch  and  interactive  use, 
system  and  user  files,  log  and  permanent  files  and  so  on.  We  believe  that  information  on  the  behavior  of 
each  of  these  file  classes  can  be  of  great  value  in  designing  a  DFS  and  have  considered  them  separately 
when  clear  differences  exist. 

3.  Data  Collection  Environment 

The  dau  used  in  this  paper  were  collected  from  a  VAX  ll/‘'80  on  the  University  of  Rix;hester  Computer 
Science  Department  Internet.  At  the  time  that  the  data  was  collected  (September  1985),  the  internet 
consisted  of  a  VAX^  11/780,  4  VAX  11/750’s,  7  Sun  workstations,  13  Xerox  Dandelion  workstations,  3 
Symbolics  LISP  machines  and  a  number  of  special  purpose  devices.  The  11/780.  Seneca^,  was  selected  as 
the  pnmary  machine  for  data  collection  because  it  was  far  and  away  the  most  heav  ily  used  of  our  systems. 
Seneca  had.  at  the  time,  4MB  of  memory  .  560.MB  of  disk  storage  and  was  running  4.2BSD  LM.X.  Ific 

V.AX  is  a  trademark  of  Digital  Equipment  Corporation 

Xiur  local  VAXen  are  named  after  Western  New  York  State's  Finger  Lakes 


system  supported  roughly  200  users.  The  primary  user  activities  were  program  development  (as  part  of  our 
research  effort),  text  editing  and  formatting,  reading  news  and  reading  personal  mail.  Seneca  also  acted  as  a 
USENET  news  and  UUCP  mail  relay  [Nowitz  78).  There  was  relatively  little  database  activity. 

Data  were  also  collected  from  two  of  the  11/7 50’s,  Preliminary  analysis  of  the  11/750  data  merely 
confirmed  the  importance  of  Seneca  in  our  environment.  Neither  of  the  11/750’s  had  file  system  activity 
levels  greater  than  15%  of  that  seen  on  Seneca.  Because  of  this,  only  the  Seneca  data  were  fully  analyzed. 


4.  Data  Collection  Method 

Two  types  of  data  were  collected:  1)  a  static  "snapshot"  of  the  file  system  and  2)  a  running  log  of  file 
system  activity. 

4.1.  Static  Snapshot 

The  static  snapshot  provides  a  picture  of  the  entire  file  structure  on  a  machine  at  a  given  point  in  time.  The 
information  generated  for  each  file  system  object  that  we  are  interested  in  is  given  in  table  1.  Processing 
starts  at  the  root  of  the  file  system  hierarchy  and  recursively  traverses  the  directory  tree,  logging  each  object 
encountered. 

A  static  snapshot  was  uken  of  the  Seneca  file  system  when  file  system  logging  (section  4.2)  was  staned.  This 
snapshot  was  used  as  a  starting  point  for  the  analysis  programs  (section  5)  and  also  provided  information  on 
the  static  file  size  distribution. 

4.2.  Logging  File  System  .Activity 

The  4.2BSD  UNIX  kernel  was  modified  to  log  selected  system  calls  made  by  users.  The  calls  logged  can  be 
classified  as  follows: 

(1)  Directory  structure  modifications:  mkdir,  rename,  rmdir  and  symlink, 

(2)  Process  context:  chdir.  chroot.  exit,  fork/vfork  and  setreuid. 

(3)  Other  references:  close,  exeev/exeeve.  link,  open/creat,  truncate,  unlink. 

The  logging  of  these  calls  has  a  negligible  effect  on  the  performance  of  the  host  (less  than  1%). 


object 

output 

directory 

name,  device,  mode 

regular  file 

name,  device,  inode,  size  (bytes) 

symbolic  link 

name,  target  file 

special  file 

name 

Table  T.  snapshot  output 


Table  2;  dynamic  log  structure 
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relative  references  made  by  a  process  (those  not  starting  from  the  root  of  the  file  system  tree).  Chroot 
changes  the  root  of  the  file  system  as  seen  by  a  process.  Fork  (fork  and  vfork  system  calls)  and  exit  create 
and  destroy  processes.  Logging  these  allows  us  to  keep  track  of  processes  created  for  each  user.  Seireuid 
changes  the  effective  "owner"  of  the  current  process.  This  is  the  mechanism  for  logging  into  the  system. 

The  remaining  records  (close,  execute,  link,  open,  truncate  and  unlink)  are  the  actual  references  to  files. 
Execute  (execv  and  execve  system  calls)  executes  a  file,  replacing  the  current  process  with  the  image  given 
in  the  file.  Link  and  unlink  add  and  delete  directory  entries  for  files.  If  unlink  removes  the  last  link  to  a 
file,  the  file  is  deleted.  Open  (open  and  creat  system  calls)  opens  or  creates  a  file  or  opens  a  directory. 
Processes  access  files  either  by  explicitly  opening  them  or  by  inheriting  open  files  from  their  parents. 
Truncate  shortens  a  file.  Close  records  indicate  that  a  process  no  longer  has  a  file  open.  They  are  generated 
by  either  a  close  system  call  or  by  a  prcKess  exit.  As  mentioned  earlier,  a  process  may  inherit  open  files 
from  its  parent.  If  this  happens,  the  close  record  is  generated  when  the  last  process  having  access  to  a  file 
due  to  the  open  closes  the  file  or  exits  (for  those  in  the  know:  we  log  the  release  of  the  kernel  open  file 
table  entry).  We  only  log  closes  for  regular  (dau)  files.  Closes  are  not  logged  for  directories  or  for  special 
files  (files  corresponding  to  devices).  Since  directories  arc  short,  completely  scanned  when  opened  and  can 
only  be  opened  for  reading,  close  records  for  directory  opens  would  have  given  us  little  useful  information. 
Special  files  are  not  analyzed  in  this  study  (except  for  a  count  of  opens). 

The  calls  listed  in  table  2  are  logged  for  all  processes  in  the  system.  In  addition,  a  small  number  of 
administrative  records  having  to  do  with  enabling  and  disabling  data  collection  arc  logged.  The  most 
important  of  these  is  the  process  state  record.  A  process  state  record  contains  information  on  the  mid. 
working  directory,  root  directory  and  command  name  for  a  process.  One  of  these  is  logged  for  a  process  the 
first  time  it  appears  in  a  log.  but  only  if  we  don't  already  have  this  information  for  the  process.  PrtKess  state 
records  arc  only  necessary  for  prcKcsscs  that  exist  before  logging  is  surted  (and  for  their  children  until  we 
log  the  parent).  They  give  us  a  way  to  kKate  the  process  in  the  directory  tree  and  to  classify  it  as  a  user, 
system  or  net  process. 

The  4.2BSD  tracing  package  we  have  described  differs  from  the  one  developed  independently  at  Berkeley 
by  Zhou  et  al.  [Zhou  85|  in  a  number  of  ways.  Ihe  most  important  difference  is  that  we  don't  collect 
information  on  internal  file  operations.  This  means  that  we  have  less  information  on  the  timing  of  these 
operations  to  files  and  on  which  bytes  arc  accessed.  We  do,  however,  log  the  number  of  bytes  read  from  or 
written  to  an  opened  file.  As  we  will  see  later  on  (section  6.1).  most  files  in  our  environment  are  read  or 
written  completely  and  are  usually  open  for  only  a  shon  period  of  time.  These  results,  combined  with  the 
fact  that  most  DFS's  treat  files  as  a  whole,  mean  that  the  omission  of  internal  file  operations  is  not 
important  for  our  particular  application. 

We  also  collect  less  information  per  record.  In  panicular.  all  of  our  times  are  real  times  at  the  finish  of  the 
system  call.  Zhou  et  al.  record,  in  addition  to  real  times,  the  durauon  of  the  call  and  process  vinual  times. 
We  made  a  decision  early  on  to  collect  the  minimum  information  necessary  for  our  purposes.  This  allows 
us  to  collect  and  process  data  for  a  longer  penod.  but  means  that  our  trace  is  sensitive  to  the  capacity  of  the 
machine  that  the  dau  was  collected  on.  Adjusting  for  this  would  be  difficult  in  any  case. 

Finally  we  collect  information  on  high  level  directory  operations  (create,  delete  and  open).  This  allows  us 
to  track  process  locations  in  the  directory  tree  so  that  we  can  accurately  analyze  relative  file  references.  It 
also  gives  us  the  dau  needed  to  analyze  directory  reference  patterns. 


>"  %■  v.v 


The  trace  data  collected  by  Ousterhout  et  al.  {Ousterhout  85j  includes  information  on  seeks  (so  that  read 
and  write  data  may  be  derived),  but  lacks  information  we  record  that  allows  references  to  be  classified  by 
file  type  and  file  owner.  We  also  include  directory  and  process  information  not  present  in  their  trace. 

Note  that  our  package  does  not  collect  a  full  trace  of  file  system  activity.  We  don’t  collect  information  on 
inode  accesses,  paging  activity,  internal  file  operations  (except  for  the  total  number  of  bytes  read  and 
written),  or  protection  and  status  related  calls.  However,  our  package  does  generate  detailed  information  on 
the  most  common  operations  on  files  and  directories  as  a  whole  (open,  close,  create,  delete,  execute  and  so 
on)  and  on  overall  read  and  write  activity  for  opened  files.  This  information  provides  a  useful  basis  for 
investigating  file  usage  patterns  and  is  sufficient  for  trace  driven  studies  of  most  DFS's. 

5.  Analysis  Method 
5.1.  Basic  .\pproach 

The  data  in  the  raw  form  described  in  table  2  is  difficult  to  analyze.  There  is  no  obvious  correspondence 
between  opens  and  closes,  unlinks  are  not  associated  with  the  files  they  affect,  no  direct  information  is 
available  on  process  working  directory  or  owner  and  so  on.  A  library  of  analysis  routines  was  written  to 
address  these  difficulties.  Ihe  rouunes  maintain  enough  state  about  the  system  being  analyzed  to  allow  the 
necessary  associations  to  be  made.  .Mtemativcs  would  have  been  to  reformat  the  file  reference  logs  so  that 
each  record  conuined  the  information  necessary  for  its  analysis  (see.  for  example  [Zhou  85])  or  to  collect 
more  information  for  each  reference.  We  chose  to  derive  the  information  at  the  time  of  analysis  to 
minimize  the  disk  resources  needed  (and  so  maximize  the  logging  period).  Of  course,  one  pays  a  penalty  m 
analysis  time  for  doing  this.  Using  this  approach,  a  simple  analysis  of  the  trace  described  in  this  paper  (2.5 
million  events  occupying  ''OMB  of  disk)  takes  about  5  hours  of  11/780  CPU  time.  This  is  adequately  fast 
for  our  needs. 

Analysis  proceeds  in  two  phases.  Dunng  the  initialization  phase,  a  snapshot  of  the  directones  in  the  system 
being  analyzed  is  read  in  and  used  to  set  up  a  model  of  the  onginal  directory  structure.  Dunng  data 
analysis,  log  records  are  read  and  passed,  one  by  one.  to  user  analysis  rouunes.  These  log  records  are  also 
used  to  update  state  information  on  files,  directones  and  prtKesses  in  the  system,  creating  and  destroying 
them  to  maintain  an  accurate  model.  Given  this  up  to  date  state  information,  the  library  routines  can 
perform  the  associations  mentioned  above  and  pass  this  information  on  to  the  user  rouunes. 

ITiere  are  some  conventions  worth  menUoning  here  that  are  used  by  all  analysis  programs: 

(1)  Calculations  involving  file  sizes  are  always  based  on  the  size  of  the  file  when  it  is  closed  or 
executed. 

(2)  File  reads  and  writes  are  assumed  to  occur  at  the  umc  a  file  is  closed  (we  didn't  have  more 
accurate  informauon  on  these  operations).  Since  the  time  most  files  are  open  is  usually 
considerably  shoner  than  any  of  our  histogram  resolutions,  this  has  no  noticeable  effect  on  our 
results, 

(3)  File  lifetimes  run  from  the  time  a  file  is  created  (based  on  a  create  flag  in  the  open  call)  until 
the  ume  the  underlying  inode  supporting  the  file  is  deleted.  This  doesn't  happen  until  there  are 
no  links  to  the  file  left  and  there  are  no  active  opens,  so  the  delete  Ume  can  (and  frequently 
docs)  differ  from  the  Umc  of  the  last  unlink.  File  version  lifetimes  are  handled  in  a  similar 
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fashion. 

(4)  Processes  occasionally  open  a  file  and  then  open  it  again  before  closing  it.  This  is  usually  done 
to  get  both  read  and  write  access  to  a  file  without  using  the  mechanisms  for  this  built  into  the 
4.2BSD  kernel.  We  honor  the  intent  (not  the  method)  by  combining  these  opens  into  one  open 
with  both  access  modes.  This  affects  only  0.7%  of  the  opens  and  so  is  not  an  important 
consideration  in  any  case. 


5.2.  Cuts 

We  are  interested  in  investigaung  both  the  overall  pattern  of  requests  to  the  file  system  and  in  the  patterns 
for  various  classes  of  users  and  files.  Past  work  has  often  ignored  the  distinction  between  batch  and 
interactive  use.  system  and  user  files,  log  and  permanent  files  and  so  on.  We  believe  that  information  on  the 
behavior  of  each  of  these  file  and  user  classes  can  be  of  great  value  in  developing  a  DPS  and  have 
developed  a  number  of  data  cuts  to  separate  the  classes  of  interest.  We  use  three  basic  types  of  cuts: 

(1)  Cuts  on  the  ruid  (owner)  of  processes  making  requests  (UL'CP/USENHT  network,  system  and 
user). 

(2)  Cuts  on  the  owner  of  files  (L'L'CP/L'SENET  network,  system  and  user). 

(3)  Cuts  based  on  the  purpose  of  files  (log,  permanent,  temporary  ). 

Some  of  these  can  be  combined  to  give  other  more  specific  cuts.  14  cuts  are  used  in  this  paper.  The  cuts 
and  their  meanings  are: 

(1)  no  cut:  This  cut  passes  all  records  in  the  log  to  the  user  analysis  routines. 

(2)  ruid_Nf  'r:  Passes  references  bv  wh.it  we  term  ne/  processes.  Net  processes  are  those  running 
under  LLCP.  LSFNE  I  news  or  notes  accounts.  Most  of  these  processes  run  in  batch  mode 
and  so  this  cut  gives  us  a  sample  that  is  considerably  different  from  an  interactive  one.  ITiis 
category  has  been  broken  iiut  from  the  system  and  user  categories  because  of  the  batch- 
onented  nature  of  the  references  and  the  large  number  of  references  by  net  processes  (roughly 
1/3  of  the  references  m  this  stuuv  and  as  much  as  70%  of  the  non-system  references  in  earlier 
studies  [Floyd  851).  don  i  include  references  due  to  Seneca  being  on  the  Rochester 
Internet  in  the  ruid_NFT  categorv. 

(3)  ruid_SYSTE.'Vl:  Passes  references  by  system  processes  (those  running  under  root  daemon, 
games  and  other  miscellaneous  system  accounts).  System  prtxtesses  arc  pnmanly  daemons  that 
provided  widely  used  services  (such  as  spooling  and  network  status  reporting),  processes 
created  on  behalf  of  users  to  perform  privileged  operations,  and  penodic  maintenance 
processes. 

(4)  ruid_L’SER:  Passes  references  by  processes  running  under  user  accouncs. 

(5)  owner_NET:  Passes  references  to  files  owned  by  UUCP,  USENET  news  and  notes  accounts. 
These  are  primarily  news  articles  and  UUCP  spool  files. 

(6)  owner_S^ STEM;  Passes  references  to  files  owned  by  the  system  accounts  mentioned  above. 
Ihis  includes  major  administrative  and  status  files  (for  example,  /etc/passwd).  system  libraries, 
system  include  files  and  so  on. 

('')  owncr_USF.R:  Passes  references  to  user  files. 


(8)  file_LOG;  A  number  of  files  on  any  UNIX  system  are  used  to  keep  logs  of  acuviiy.  Hxamples 
include  /usr/adm/messages,  /usr/adm/wunp  and  user  mbox  files.  Since  we  expect  the  access 
patterns  for  these  files  to  be  considerably  different  from  that  for  files  as  whole  and  since  these 
files  are  generally  quite  large,  we  use  a  cut.  file_1.0G,  that  allows  us  to  analyze  only  these  logs. 

We  had  originally  intended  to  place  in  this  category  just  those  files  opened  with  append-only 
access.  However,  it  soon  became  clear  that  this  mode  of  access  is  basically  never  used.  Instead, 
most  logs  are  opened  write-only,  a  seek  is  done  to  the  end  of  the  file  and  then  the  log  enu7  is 
appended.  If  several  processes  are  trying  to  update  a  log  simultaneously,  the  results  are 
unpredictable.  Some  of  the  busier  logs  on  our  system  are  scrambled  on  a  regular  basis  using 
this  "method." 

We  were  eventually  forced  to  use  the  name  of  the  file  given  in  the  open  call  to  make  this  cut. 
I.uckily.  most  of  the  log  files  on  the  system  have  well  know  n  names  and  an  examination  of 
source  for  commonly  run  programs  and  of  the  file  reference  logs  enabled  us  to  find  the  rest  of 
the  log  files  on  the  system. 

(9)  file_PK.R.M:  Passes  references  to  permanent  files.  Ihis  includes  all  files  that  aren't  log  files 
(file_LOG)  or  temporary  files  (file_  fH-VIP). 

( 10) file_TEMP;  Passes  references  to  temporary  files.  This  includes  files  that  arc  created  on  a 
special  file  system  (/tmp),  temporary  spool  files,  lock  files  and  other  Mich  transitory  files.  .Most 
temp  files  are  clearly  identified  by  cither  their  name  (a  special  template  is  usually  used  to 
create  temp  file  names)  or  by  the  directory  in  which  they  arc  created. 

(11) owner_USKR  +  ruid_LSER  (shown  as  U  in  tables  and  figures):  Passes  references  that  satisfy 
both  the  owner_USHR  and  ruid_USHR  cuts.  These  are  user  references  to  user  files.  The 
owner_LSHR-ruid_LSKR  cut  produces  results  similar  to  the  owner_I.Sf  R  cut.  There  arc 
about  9.5^c  fewer  references  for  the  U  cut,  but  the  resultant  distributions  are  nearly  identical.  It 
IS  included  here  for  comparison  with  the  next  three  cuts. 

(12) owncr_USFR-)-njid_L'SER  +  file_l.OG  (shown  as  L'-f  filc_LOG  in  tables  and  figures):  Passes 
user  references  to  user  log  files. 

( 13) owner_L.SK.R ruid_LSER -)- file_PERM  (shown  as  L-ffilc_PFRM  in  tables  and  figures): 
Passes  user  references  to  user  permanent  files. 

(14) owner_LSFR-(-ruid_L.Sf  R-bfilt'_rFMP  (shown  as  L -♦-fik“_ TF.MP  in  tables  and  figures): 
Passes  user  references  to  user  temporary  files. 

5.3.  .Analysis  Complications 

Ihe  data  analysis  did  not  proceed  as  smoothly  as  we  had  hoped.  This  section  describes  some  of  the 
problems  we  expenenced  and  suggests  changes  m  the  data  collecuon  and  analysis  that  would  help  avoid 
these  problems  in  the  future.  None  of  these  problems  was  senous  enough  to  have  a  noticeable  effect  on  our 
results. 

It  was  not  always  possible  to  pair  up  opens  and  closes  correctly.  In  most  cases  there  was  onlv  one  open  for 
a  goen  file  to  assrxiate  a  close  with  or  the  prtxress  numbers  of  an  open  and  close  matched.  In  cases  were 
this  was  not  true,  we  looked  for  an  open  that  was  made  by  an  ancestor  of  the  process  making  the  close 
request.  Sometimes  there  were  multiple  opens  to  a  file  outstanding  among  those  made  by  ancestors  This 
problem  (Kcurred  less  than  O.OT^c  of  the  time  and  was  dealt  with  by  using  the  most  recent  open  h\  m 
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ancestor.  A  more  accurate  solution  would  have  been  to  record  an  open  session  number  in  the  log  and  use 
this  to  make  the  association,  but  the  low  frequency  of  occurrence  of  this  problem  and  the  relative 
unimportance  of  the  derived  numbers  makes  this  solution  unnecessary  for  us. 

There  were  two  peculiarities  in  the  4.2BSD  kernel  that  resulted  in  some  surprises  in  the  logs.  One.  having 
to  do  with  incorrect  returns  from  the  fork  call,  was  caught  and  corrected  before  the  data  analyzed  here  was 
collected.  The  other,  an  inconsistent  handling  of  error  indicators  when  a  process  was  forcibly  terminated, 
caused  us  to  lose  some  close  and  exit  records.  This  was  not  discovered  until  fairly  late  in  the  analysis.  Less 
than  0.03%  of  the  close  records  and  about  0.5%  of  the  exit  records  were  not  recorded  because  of  this 
problem.  Since  the  number  of  close  records  lost  w  as  so  low.  we  made  no  attempt  to  correct  the  problem. 

Files  were  classified  (as  log.  perm  or  temp)  the  first  time  they  were  seen  in  the  logs.  Occasionally  this 
classification  was  incorrect.  While  we  were  developing  the  cuts,  we  classified  a  number  of  files  by  hand, 
using  information  on  the  programs  making  the  requests  and  the  full  history  of  reference  patterns  to  files. 
Comparing  our  classifications  to  those  done  by  the  analysis  routines  (using  file  name  and  directory 
information)  showed  that  only  a  few  tenths  of  a  percent  of  files  were  incorrectly  classified.  It  would  be 
difficult  to  do  better  than  this  without  explicit  information  on  the  intended  usage  of  files.  This  information 
is  just  not  available  under  UNIX. 

We  had  to  retain  a  large  amount  of  state  in  order  to  associate  unlink  records  with  files  and  to  interpret  their 
meaning.  Since  we  needed  most  of  this  state  for  other  reasons  (uid  classification,  directory  studies  and  so 
on)  this  was  not  really  a  problem  for  us.  Including  the  file  id  and  a  count  of  the  number  of  remaining 
references  in  unlink  records  would  make  it  possible  to  interpret  them  in  the  absence  of  the  <uie 
information. 


6.  File  Reference  Patterns 

Roughly  7  days  of  data  were  collected  on  Seneca  (168.82  hours,  from  3:21am  on  .Vlonday.  September  16. 
1985  to  4;10am  on  .Monday.  September  23).  During  this  period  there  were  142  active  users  of  the  system. 
There  were  generally  20  to  30  logged  in  users  at  any  given  time  on  weekday  afternoons,  with  load  averages 
running  between  5  and  10. 

In  section  6.1.  we  examine  the  overall  pattern  of  open  and  read/write  requests.  Section  6.2  briefly  examines 
exeeve  patterns.  Section  6.3  concentrates  on  user  files.  Our  approach  in  all  cases  is  to  present  only  those 
tables  and  histograms  that  are  panicularly  characteristic  or  sinking.  A  more  complete  breakdown  of  many 
of  our  results  may  be  found  in  the  appendices. 

6.1.  Overall  Open  and  Read/VVrite  Patterns 

6.1.1.  Basic  Statistics 

-\  summary  of  the  records  collected  is  given  m  table  3.  Ihe  first  three  columns  give  the  number  of  records 
of  each  type  collected,  the  average  rate  for  that  type  of  record,  and  the  percentage  of  the  collected  records 
that  this  represents.  Ihe  remaining  columns  show  the  number  of  records  collected  cut  by  the  ruid  of  the 
calling  prtKcss  and  the  percentage  of  the  total  for  the  ruid  class. 


.  v'*  .'n 
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From  this  table  we  can  see  that  each  of  our  ruid  categories  accounted  for  roughly  1/3  of  the  activity  on  the 
system.  The  majority  of  the  file  system  requests  were  for  opens  and  closes,  with  most  of  the  rest  of  the 
categories  being  a  factor  of  5  or  more  down  from  this  (of  course  we  didn't  record  reads,  writes  and  seeks, 
all  of  which  would  be  a  significant  component  of  a  full  trace).  Processes  made,  on  the  average,  5.3  open 
requests. 

Table  4  gives  the  number  of  opens  to  each  type  of  file  object  on  the  system.  For  the  purposes  of 
companson,  the  SLAC  trace  [Porcar  82]  included  about  237.000  opens  to  data  (regular)  files  in  a  similar 
period.  The  remainder  of  the  analysis  in  this  paper  deals  with  only  regular  files,  the  largest  category  in 
table  4.  Directory  access  patterns  (including  explicit  directory  opens)  are  analysed  in  a  companion  paper 
[Floyd  86b].  Block  and  character  special  files  are  used  in  UNIX  to  provide  access  to  devices  and  are  not  of 
interest  to  us.  They  are,  in  any  case,  a  small  fraction  of  the  total  number  of  opens. 

R 


record 

no  cut 

ruid_SYSTEM 

ruid_ 

USER 

count 

per  hr 

fracuon 

count 

fraction 

count 

fraction 

count 

mkdir 

936 

5.5 

0.04% 

2 

0% 

139 

0.02% 

rename 

3211 

19 

0.13% 

408 

0.04% 

857 

0,11% 

rmdir 

913 

5.4 

0.04% 

0.11% 

0 

- 

133 

002% 

symlink 

16 

0.1 

0% 

■o 

- 

3 

0% 

13 

0% 

chdir 

136063 

806 

5.4% 

19102 

2.6% 

71854 

7.1% 

45106 

5.7% 

chroot 

0 

- 

• 

0 

- 

0 

- 

0 

. 

exit 

180270 

1070 

7,1% 

31219 

4.2% 

85917 

8.5% 

63133 

8,0% 

fork 

181511 

1080 

7.1% 

29271 

4.0% 

90735 

8.9% 

61503 

7.8% 

seu'euid 

16^'2 

99 

0.66% 

4372 

0.59% 

9698 

0.95% 

2701 

0.34% 

close 

\^40'2 

44'0 

29.7% 

249837 

34.0% 

298164 

29.4% 

205666 

26.2% 

execute 

125064 

’41 

4.9% 

26761 

3.6% 

38093 

3.8% 

60209 

7.7% 

link 

42929 

254 

1.7% 

25694 

3.5% 

7301 

0.72% 

9934 

1.3% 

open 

965087 

5720 

38.0% 

277350 

37.7% 

393661 

38.8% 

294070 

37.4% 

truncate 

0 

- 

0 

- 

0 

- 

0 

- 

unlink 

130929 

”6 

5.2% 

68342 

9.3% 

19861 

2.0% 

42726 

5.4% 

total 

2537773 

15040 

100% 

735469 

100% 

1015697 

100% 

786190 

100% 

Table  3:  records  logged 


type 

no  cut 

ruid. 

.NET  , 

ruid_SYSTHM 

ruid_ 

USER 

opens 

fracuon 

fraction  1 

opens 

regular  file 

754285 

78.2% 

298186 

■QiSI 

206268 

directory 

17.7% 

6.2% 

72625 

18.4% 

80548 

27.4% 

block  special 

922 

■HI 

- 

0.02% 

862 

character  special 

39432 

4.1% 

3.7% 

5.8% 

6392 

2.2% 

total 

965087 

100% 

277350 

393661 

100% 

294070 

Table  4:  Opens,  by  object  type 


Opens  may  be  further  broken  down  by  the  class  of  file  being  opened  and  by  the  owner  of  the  file.  This 
information,  plus  statistics  on  how  many  files  there  are  in  each  category,  is  given  in  table  5.  We  see  here 
that  2/3  of  the  references  were  to  perm  files,  although  temp  files  made  up  4/5  of  the  files  referenced. 
Relatively  few  references  were  made  to  user  files.  The  large  number  of  net  files  may  be  attributed  to  a  daily 
news  expiration  procedure  that  reads  the  headers  of  all  news  articles. 

Information  on  read/write  modes  for  open-close  sessions  is  given  in  table  6  (note  that  percentages  in  this 
table  sum  horizontally).  Overall,  files  opens  were  evenly  split  between  opens  with  read-only  access  and 
opens  for  write-only  or  read-write.  Users,  however,  opened  most  files  read-only.  Log  files  were  generally 
opened  write-only. 

Perm  files  are  categorized  by  their  function  in  table  7.  This  categorization  was  done  using  the  directories 
that  files  appeared  in  and/or  based  on  file  names  and  extensions.  "System  configuration"  files  are  those 
appearing  in  /  and  /etc.  Examples  are  /vmunix  (the  bootable  kernel  image)  and  /etc/passwd  (passwords 
and  other  information  on  accounts).  "Rwho  daemon"  files  are  used  to  maintain  status  information  about 
machines  on  the  network.  "Library"  files  are  those  in  /lib,  /usr/lib  and  so  on  (this  includes  both  program 
libranes  and  additional  configuradon  files).  Files  with  names  beginning  with  are  grouped  into  the 
category  "personal  configuration."  These  files  traditionally  contain  staaup  commands  and  status 
information  for  various  programs  and  are  used  to  tailor  and  maintain  an  individual's  environment. 


cut  1 

files 

opens/filc  ; 

file  LOG 

35662 

4.-’% 

0.5% 

— 

file  PER  .VI 

499193 

66.2<^r 

16352 

16.2% 

file_TF.MP 

219430 

29,1% 

84327 

83.3% 

2.6  1 

owner  NET  ; 

249733 

33.1% 

46207 

45.7% 

5,4  : 

owner_SYSTE.M 

392790 

52.1% 

25062 

24.8% 

15.7  i 

owner_USER 

111762 

14.8% 

30822 

30.5% 

3.6 

no  cut  1 

754285 

100% 

101185 

Table  5:  Class  and  owner  of  opened  regular  files 


cut 

reac 

-only 

write-only 

total 

opens 

fraction 

opens 

fraction 

fraction 

opens 

file  LOG 

735 

2.1% 

34819 

97.7% 

97 

0.3% 

35651 

file  PERM 

282853 

56.7% 

180976 

36.3% 

35200 

7.1% 

499029 

file.TEMP 

104828 

47.8% 

96766 

44.1% 

17794 

8.1% 

219388 

owner_NET 

148150 

59.3% 

79739 

31.9% 

21830 

8.7% 

249719 

175787 

44,8% 

198183 

18712 

4.8% 

392682 

owner_USER 

64479 

57.7% 

34639 

BbIM 

12549 

11.2% 

111667 

ruid  NE’l 

mm 

79111 

31.7% 

23713 

9.5% 

249817 

ruid.SVSTEM 

mm 

188233 

63.1% 

10723 

3.6% 

298161 

ruid_LSER 

WBM 

69.0% 

45189 

22.0% 

18654 

9.1% 

no  cut 

388416 

51.5% 

312561 

41.4% 

7.0% 

Table  6:  Mode  of  open  for  open-close  sessions 
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Examples  include  .login,  .profile  and  .newsrc.  The  rest  of  the  categories  have  the  obvious  meaning.  .Note 
that  over  half  of  the  opens  to  perm  files  were  made  to  0.7%  of  the  files  (those  in  the  first  two  categories). 
These  files  were  basically  all  system  configuration  and  status  files.  Activity  to  these  two  categories  represents 
roughly  40%  of  the  total  file  opens  we  observed,  indicating  that  a  substantial  fracdon  of  the  system  acdvity 
was  devoted  to  communicadng  and  maintaining  informadon  about  itself  and  about  other  hosts  on  the 
network. 

6.1.2.  Per  Open  Results 

The  open  acdvity  over  dme  is  shown  in  figure  1.  Opens  followed  a  daily  pattern  with  a  busy  period 
between  9am  and  6pm.  overlaid  by  strong  bursts  due  to  net  activity  (mostly  news  expiradon  and  news 
recepdon).  Weekends  were  reladvely  quiet. 

Figure  2  plots  the  open  acdvity  for  just  the  first  day  of  the  trace.  This  shows  the  work  day  busy  period 
more  clearly.  Looking  closely,  we  can  see  that  user  acdvity  accounted  for  roughly  half  of  the  daydme  load. 
System  opens  had  a  base  level  (the  rwho  daemon)  overlaid  by  acdvity  that  followed  or  lagged  slightly 
behind  user  and  net  acdvity.  That  is.  a  significant  pan  of  the  system  acdvity  was  indirectly  due  to  the  other 
classes.  This  acdvity  may  be  attributed  to  logins,  spoolers,  mailers  and  so  on. 

The  read  and  write  acdvity  to  regular  files  corresponded  only  roughly  to  the  open  acdvity.  This  can  be  seen 
by  comparing  figures  3  and  4  with  figure  1^.  Reads  and  (especially)  writes  were  fairly  bursty  on  the 
resoludon  used  in  these  figures  (about  2  hours).  The  bursdness  increased  as  the  resoludon  used  increased. 
Figure  5  shows  the  throughput  of  the  file  system  during  a  typical  period  of  heavy  user  acdvity.  averaged 
over  10  second  intervals.  This  represents  acdvity  for  about  25  logged  in  users.  It  is  interesting  to  note  that 
the  peak  rates  in  this  figure.  35 K  bytes/second,  would  present  little  problem  for  today's  LAN  technologies, 
even  with  fairly  hefty  open  and  transfer  protocol  overheads.  Our  results  here  are  sunilar  to  those  presente  J 
by  Ousterhout  et  al.  [Ousterhout  85]  and  supports  their  contention  that  such  networks  can  support  large 
numbers  of  users. 


categorv 

opens 

%  opens 

files 

%  files 

opens/file 

system  configuradon 

123481 

24.7% 

100 

0.6% 

1235 

rwho  daemon 

166761 

33.4% 

13 

0.1% 

12830 

library 

59245 

11.9% 

222 

1.4% 

267 

manual  pages 

18371 

3.7% 

1597 

9.8% 

11.5 

news 

40022 

8.0% 

5828 

35.6% 

6.9 

program  source 

10596 

2.1% 

1499 

9.2% 

7.1 

includes 

13767 

2.8% 

344 

2.1% 

40 

objects 

5618 

1.1% 

468 

2.9% 

12 

personal  configuradon 

23125 

4.6% 

1676 

10.2% 

13.8 

mail  spool 

3621 

0.7% 

524 

3.2% 

6.9 

other 

34586 

6.9% 

4081 

25.0% 

8.5 

Table  7:  Function  of  opened  perm  files 


rilie  unusually  heavy  read/wnte  acuvity  on  Thursday  was  caused  by  repeated  execution  of  a  large  user  text  fonmatung  job  ( for 
ntatung  a  Ph  D  dissertaDon).  Most  of  the  aaivity  was  to  temp  files 


3 


ruid_USER 
ruid_NET 
ruid.SYSTF.M 
no  cut 


average 
opens 
per  second 


1  f :  Aiii-iJ  •;  : 

h  /lO  fi/  : 

•  hr  A  i  •'  V. 

0.0  - 1 — - 1— ::: - 1—^ - 1— 


;  i  1  \  ^  ••■■■» 

•.  .'V  r<  .  '•••y  ;  ,,  .•  •'  ’ 


0:00  (Tue) 


0:00  (Thu) 

time  of  open 


0:00  (Sat) 


0:00  (Mon) 


Figure  1:  Average  number  of  regular  file  opens  per  second  (*2  hour  resolution) 
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Figure  2;  Average  number  of  regular  file  opens  per  second  ('15  minute  resolution) 
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Figure  3:  Bytes  read  from  regular  files  (*2  hour  resolution) 


ruid_LSF,R 

ruid_NF.T 

mid^SYSTFM 


kbytes 

written  2 
per  second 


T  V  I  » 


-  -WpJ: 


0:00  (Tue) 


0:00  (Thu) 

time  of  close 


0:00  (Sat) 


0:00  (Mon) 


Figure  4;  Bytes  written  to  regular  files  ('2  hour  resolution) 
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Figure  5:  Bytes  transferred  to  and  from  regular  files  (10  second  resolution) 


Table  8  shows  the  average  throughput  of  the  file  system  over  the  life  of  the  trace  for  each  class  of  user. 
Note  that  reads  accounted  for  84%  of  the  bytes  transferred.  Users  accounted  for  over  half  of  all  bytes 
transferred,  even  though  they  made  only  about  a  quarter  of  the  opens  to  regular  files  (table  4). 

Referenced  files  on  Seneca  tended  to  be  small,  particularly  compared  to  IBM  mainframe  environments  such 
as  the  ones  studied  by  Porcar.  Figure  6  and  table  9  show  file  size  distributions  on  Seneca,  weighted  by  the 
number  of  opens  made  and  cut  by  the  class  of  file.  Note  that  these  are  cumulative  distributions.  .\t  any 
point  on  a  curve,  the  y  value  is  the  fraction  of  files  with  sizes  less  than  or  equal  to  the  x  value.  For 
companson  purposes,  we  have  included  here  the  static  file  size  distnbution.  as  derived  from  a  snapshot  of 
the  file  system  taken  at  the  beginning  of  data  collection  (this  is  the  distnbution  that  would  result  if  each  file 
on  the  system  were  opened  once).  Table  9  also  includes  statistics  for  on-disk  permanent  files  referenced 
dunng  the  SLAC  trace. 
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Table  8:  Bytes  read/written  for  regular  files 
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Figure  6:  Dynamic  file  size  distributions  (cumulative,  measured  at  close) 
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Table  9:  File  size  distributions 


From  figure  6  we  see  that  there  were  substantial  size  differences  between  opened  log,  perm  and  temp  files. 
The  large  number  of  zero  length  temp  files  was  due  to  frequent  creation  of  lock  files  (these  lock  files  sene 
as  a  very  crude  mutual  exclusion  mechanism).  Log  files,  on  the  other  hand,  were  generally  an  order  of 
magnitude  or  more  bigger  than  other  files.  The  jump  at  60  to  100  bytes  in  the  perm  file  distribution  was 
due  to  the  rwho  daemon,  which  was  updating  a  set  of  status  files  describing  machines  in  the  network  every 
60  seconds.  By  comparing  the  dynamic  and  static  distnbutions,  we  find  that  opens  tended  to  favor  small 
files  (due  to  lock  and  rwho  daemon  files)  and,  to  a  lesser  extent  a  few  larger  files  (administrative  files  such 
as  /etc/passwd). 

The  small  size  of  opened  files  (55%  are  under  1024  bytes,  a  common  block  transfer  size,  and  75%  are  under 
4096  bytes)  suggests  that  directory  lookup  and  open  overhead  will  play  a  large  part  in  file  access  times, 
paracularly  in  a  distnbuted  environment. 


17 


While  most  files  opened  in  our  environment  were  small,  the  majority  of  bytes  came  from  files  that  were 
much  larger;  2/3  of  all  bytes  were  read  from  files  greater  than  20.000  bytes  long.  This  is  shown  by  figure  7 
and  table  10.  which  give  distributions  for  the  size  of  opened  files,  weighted  by  the  number  of  bytes  read. 
We  have  also  included  here,  for  comparison  purposes,  the  static  space  used  distribution  (the  distribution 
that  would  result  if  each  file  on  the  system  were  completely  read  once).  The  staircase  effect  in  the  dynamic 
distributions  is  due  to  repeated  opens  and  reads  of  a  few  large  administrative  files,  /etc/passwd,  for 
example,  at  21,000  bytes,  accounts  for  almost  20%  of  the  bytes  read.  This  file  is  infrequently  modified  and 
so  would  be  a  good  candidate  for  replication  in  a  distributed  environment.  We  saw  earlier  (table  7)  that  a 
relatively  small  number  of  files  received  a  high  fraction  of  the  open  traffic.  Figure  7  gives  graphic  evidence 
of  the  corresponding  impact  on  I/O  traffic. 

Our  distributions  for  the  overall  sizes  of  opened  files  and  for  the  source  of  bytes  read  (figures  6  and  7) 
agree  with  the  distributions  found  by  Ousterhout  et  al..  By  these  measures,  at  least,  our  data  appears  to  be 


Figure  7;  Dynamic  file  size  distributions,  weighted  by  bytes  read  (cumulati\e.  measured  at  close) 
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1.28e6 

1.91e5 

1.2e5 

1.6e5 

file_PFRM.  dynamic 

0 

2.49e6 

I.19e5 

2.2e4 

2.0c5 

file_TEMP.  dynamic 

0 

1.3e6 

1.12e5 

6.8e4 

l,5e5 

all,  dynamic 

0 

2.49e6 

1.19e5 

3.4e4 

1.9e5 

all.  sutic 

0 

7.95e6 

3.6e4 

1.29e4 

5.1e4 

Table  10:  File  sizes,  weighted  by  number  of  bytes  read 


representative  of  a  university  research  environment. 


Two  figures  that  are  useful  in  estimating  the  appropriateness  of  dynamic  migration  are  the  fraction  of  a  file 
opened  for  reading  that  is  actually  read  and  the  fraction  of  a  file  opened  for  writing  that  is  actually  written. 
As  mentioned  in  section  4,  we  don’t  have  complete  information  on  which  bytes  of  opened  files  were  read 
and  written.  However,  if  we  make  the  reasonable  (for  our  environment)  assumption  that  a  given  byte  in  a 
file  was  not  usually  read  or  written  repeatedly  in  a  single  session,  we  can  use  the  counts  of  bytes  read  and 
written  from  the  close  record  to  calculate  the  fraction  of  a  file  read  or  written.  Figure  8  shows  the 
percentage  read  for  files  opened  read-only,  cut  by  the  class  of  the  file.  Figure  9  shows  the  percentage 
written  for  files  opened  write-only  and  figures  10  and  11  are  for  files  opened  read/write.  In  all  cases,  the 
size  used  is  the  size  of  the  file  when  closed.  Zero  length  files  are  omitted.  Tables  11-14  provide  some 
statistics  on  the  distributions  in  these  figures. 

From  these  figures  we  see  that  most  opens  with  read-only  or  wnte-only  access  resulted  in  the  file  being 
completely  read  or  written.  The  nouble  exception  was  for  log  files.  For  these  files,  writes  usually  just 
incrementally  extended  the  file.  This  is  shown  clearly  in  figure  9  and  indicates  that  we  have  successfully 
extracted  log  files  from  our  dau.  Much  less  can  be  said  about  the  read/write  behavior  of  files  opened  with 
read/write  access.  For  these  files,  mformation  on  usage  history  or  more  detailed  information  on  the 
intended  usage  of  the  file  would  be  needed  to  predict  the  read/wnte  behavior.  Recall  (table  6)  that  this 
category  represents  only  1%  of  the  opens  and  so  the  additional  information  will  not  usually  be  needed. 

Overall,  68%  of  files  opened  with  read  access  (read-only  or  read/wnte)  were  completely  read  and  78%  of 
files  opened  with  write  access  (wnte-only  or  read/write)  were  completely  wntten.  This  may  be  contrasted 
with  the  SLAC  data,  where  only  17%  of  opened  permanent  files  were  completely  accessed.  The  high 
percentage  of  files  completely  accessed  on  Seneca  is  due  to  the  much  smaller  file  size  and  to  the  lack  of  any 
serious  database  activity  . 

As  one  might  expect,  the  fraction  of  a  file  that  was  accessed  depended  strongly  on  the  size  of  the  file.  Very 
small  files  were  usually  completely  read  or  wntten.  Large  files  were  rarely  completely  read  or  wntten.  This 
is  shown  for  files  opened  read-only  and  write-only  in  figures  12  and  13  and  m  tables  15  and  16.  Files 
opened  with  read/write  access  followed  a  similar  pattern. 
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figure  14:  Number  of  opens  per  active  file  (cumulative) 
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6.1,3.  Per  File  Results 


Table  17:  Number  of  opens/file 


The  number  of  opens  per  file  gives  an  indication  of  the  potential  benefits  and  penalties  of  migrating  files  to 
a  user's  machine  (the  degree  of  shanng  is  also  a  factor  here).  Most  files  In  our  environment  were  opened 
only  once  or  twice  (figure  14  and  table  17)  This  may  be  attributed  to  the  large  number  of  lightly  used 
temp  files;  log  and  perm  files  saw  conjiucrably  more  acuviiy.  The  low  number  of  opens  for  most  files 
suggests  that  the  initial  placement  of  a  file  is  an  important  consideration.  We  have  also  included  in  table  17 
mformation  on  the  distribution  for  on-disk  permanent  files  in  the  SLAC  trace  (for  a  period  of  310  hours). 
STAC  perm  files  saw.  on  average,  considerably  less  activity  than  the  perm  files  in  our  environment,  despite 
the  longer  SLAC  logging  penod. 
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Figure  16:  Times  from  file  open  to  close  (cumulative) 


distribution 

min 

max 

mean 

median 

std  deviation 

file_LOG 

0 

8.6e4 

33.4 

0.08 

1140 

file_PER.M 

0 

7,6e4 

6.5 

0.08 

251 

file_TEMP 

0 

4.8e4 

20.5 

0.22 

335 

no  cut 

0 

8.6e4 

11.8 

0.1 

369 

Table  19:  Open  time  (seconds) 


Files  in  our  environment  were  usually  only  open  for  a  few  tenths  of  a  second  (figure  16  and  table  19). 
Temp  files  were  open  for  relatively  long  periods  of  time.  This  is  to  be  expected,  since  they  are  often  used  to 
store  intermediate  results  as  they  are  being  calculated.  The  distnbutton  for  perm  files  is  consistent  with  the 
small  files  sizes  and  whole  file  transfers  we  saw  earlier.  Programs  open  these  files,  transfer  data  and  then 
immediately  close  the  files. 
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5.4e5 
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9.8e3 
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file_TF.MP 
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4.2e5 

6655 

3.6 

1.8e4 
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2.2e4 
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9.7e4 
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Table  20:  File  interopen  intervals  (seconds) 


Knowledge  of  file  interopen  intervals  (the  time  from  one  open  of  a  file  to  the  next)  is  useful  in  estimating 
both  the  appropriate  time  scale  for  migration  and  the  possibilities  for  caching.  Figure  17  and  table  20  show 
that  interopen  intervals  in  our  environment  were  short  (opens  to  a  file  were  strongly  clustered).  When  a  file 
was  opened,  the  following  open  (if  any)  had  a  50%  probability  of  occurring  within  the  next  60  seconds. 
Interopen  interva.s  for  temp  files  were  particularly  short  If  a  temp  file  was  opened  multiple  times  (many 
were  not),  the  next  open  often  occurred  within  a  few  seconds  of  the  last  one.  This  is  to  be  expected  for  files 
that  are  used  to  hold  results  between  job  steps.  Log  files  also  had  shorter  interopen  intervals  than  files  as  a 
whole.  Most  log  file  opens  were  made  by  net  processes  and  these  processes  show  intense  bursts  of  activity 
(figure  2),  so  this  is  not  surprising.  The  jump  at  60  seconds  in  the  distribution  for  perm  files  is  due  to  rwho 
daemon  activity. 


The  lifetime  of  a  file  in  our  environment  depended  strongly  on  the  class  of  the  file.  Most  temp  files  lived 
less  than  a  minute.  The  overwhelming  majority  of  perm  files  had  lifetimes  that  extended  beyond  the 
logging  period.  Log  files  fell  in  between  (mostly  due  to  short  lived  UUCP  work  logs).  File  lifetime 
distributions  are  shown  in  figure  18.  Here  files  that  existed  before  logging  was  staned  or  that  continued  to 
exist  after  logging  was  terminated  were  given  lifetimes  exceeding  the  logging  period  (lie  to  the  right  of  the 
histogram).  Because  so  many  log  and  perm  files  fell  into  this  category,  we  have  not  included  the  moments 
of  these  distributions. 

Even  though  most  perm  files  have  long  happy  lifetimes,  the  data  in  these  files  is  not  so  fortunate.  This  is 
shown  in  figure  19,  where  we  have  histogrammed  the  time  from  when  a  file  is  created  or  written  to  the 
time  when  a  file  is  overwritten  or  deleted  (this  is  the  file  lifetime  used  by  Ousterhout  et  al.).  Files  that  were 
only  partially  written  are  not  included  in  this  histogram.  Again,  data  whose  lifetime  extended  beyond  the 
limits  of  our  log  were  given  lifetimes  exceeding  the  logging  period.  The  large  jump  at  60  seconds  is  due  to 
rwho  daemon  activity.  Since  we  include  all  files  here  and  Ousterhout  et  al.  included  just  new  data,  our 
results  are  not  directly  comparable. 

The  first  two  columns  of  table  21  show  the  mean  number  of  readers  per  file,  as  indicated  by  the  account 
(ruid)  of  the  reader,  and  the  percentage  of  files  with  more  than  one  reader,  cut  by  the  file  class  and  owner. 
The  next  four  columns  show  this  information  for  writers  and  for  the  overall  number  of  file  users.  The  last 
two  columns  show  the  mean  and  maximum  number  of  inversions  per  file.  The  number  of  inversions  is  the 
number  of  times  that  the  most  recent  user  of  the  file  changes  (this  is  basically  the  inversion  clustenng 
metric  used  by  Porcar  [Porcar  82]).  For  a  file  used  by  only  one  user,  the  number  of  inversions  will  be  zero. 
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Figure  18:  File  lifetimes  (cumulative,  files  living  beyond  log  period  binned  at  right) 
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Figure  19:  Version  lifetimes  (cumulative,  versions  living  beyond  log  period  binned  at  right) 


We  can  see  from  table  21  that  11.8%  of  the  files  seen  dunng  the  logging  penod  were  accessed  by  multiple 
users  (users  with  separate  accounts).  Multiple  readers  were  much  more  common  than  multiple  writers. 
Most  shared  files  belonged  to  net  ITiese  were  predominately  news  articles  (perm  files),  l.ogs  were  also 
heavily  shared.  They  frequently  had  multiple  writers  and  separate  readers.  Although  system  files  were  not 
as  heavily  shared  as  net  files,  in  terms  of  the  number  of  shared  files,  the  high  mean  number  of  inversions 
(2.92)  indicates  that  the  system  files  that  were  shared  were  nut  shy  about  it.  Few  user  files  were  shared.  The 
low  mean  number  of  inversions  (O.lll)  indicates  that  this  sharing  was  incidental  to  the  normal  use  of  user 
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Table  21:  hie  sharing,  by  hie  class  and  owner 


The  overall  distributions  are  shown  in  more  detail  in  table  22.  Note  that  very  few  files  had  more  than  2 
writers  and  that  even  the  distribution  of  the  number  of  users  per  file  drops  off  quite  sharply. 
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count 

cum 

count 

cum 

count 

cum 

count 

cum 

0 

44711 

44.2% 

11252 

11.1% 

- 

- 

89272 

88.2% 

1 

49845 

93.4% 

87510 

97.6% 

89272 

88.2% 

5838 

94.0% 

2 

3022 

96.4% 

2009 

99.59% 

7555 

95.7% 

2182 

96.2% 

3 

1244 

97.7% 

209 

99.80% 

1818 

97.5% 

666 

%.8% 

4 

685 

98.3% 

66 

99.86% 

758 

98.2% 

799 

97.6% 

5 

421 

98.8% 

23 

99.89% 

448 

98.7% 

389 

98.0% 

6 

356 

99.11% 

18 

99.90% 

369 

99.05% 

339 

98.3% 

7 

245 

99.35% 

20 

99.92% 

258 

99.30% 

318 

98.6% 

8 

167 

99.52% 

8 

99.93% 

175 

99.47% 

291 

98.9% 

9 

80 

99.60% 

12 

99.94% 

88 

99.56% 

213 

99.13% 

10 

105 

99.70% 

8 

99.95% 

111 

99.67% 

156 

99.29% 

>10 

304 

100% 

50 

100% 

333 

100% 

722 

100% 

total 

101185 

- 

101185 

- 

101185 

- 

101185 

- 

Table  22:  readers,  writers,  users  and  inversions;  no  cuts 


6.2.  Execute  Patterns 

The  basic  calls  to  run  an  executable  file  under  4.2BSD  UNIX  are  execv  and  execve.  These  calls  are  grouped 
together  under  the  heading  "execute”  in  table  3.  Users  were  responsible  for  half  of  the  execute  requests  in 
our  log  (table  23),  even  though,  as  we  saw  in  section  6.1,  they  made  only  a  quarter  of  the  opens  to  regular 
files.  Most  executes  were  done  on  system  files.  Users  owned  almost  half  of  the  executables  seen  but  there 
were  few  executes  of  these  files. 


cut 

executes 

%  executes 

executables 

%  executables 

executes/executable 

ruid_NET 

26761 

21.4% 

41 

7.1% 

653 

ruid.SYSTEM 

38093 

30.5% 

137 

23.6% 

278 

ruid_USER 

60210 

48.1% 

528 

90.9% 

114 

owner_NET 

12190 

9.7% 

34 

5.9% 

359 

owner.SYSTEM 

108646 

86.9% 

291 

50.1% 

373 

owner_USER 

4228 

3.4% 

256 

44.1% 

17 

no  cut 

125064 

100% 

581 

100% 

215 

Table  23:  Basic  active  executable  statistics 


% 


fraction 

of 

executes 


distribution 
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std  deviation 

owner_NET 

9216 

8.8e4 

44400 

35500 

23900 

owncr_S\'STFM 

4096 

l.Ie6 

34500 

21400 

84900 

owner_L'SFR 

4228 

3.2e6 

55900 

28200 

135000 

no  cut 

4096 

3.2e6 

36200 

22400 

83400 

Table  24:  Executable  file  sizes  (bytes) 


Most  executable  files  were  between  5.000  and  100,000  bytes  long  (figure  20  and  table  24).  The  relatively 
large  size  of  executables  is  a  reflection  of  the  lack  of  run-time  library  sharing.  All  executables  conuin 
whatever  code  they  need  to  run. 
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of 

executables 


distnbution 

min 

max 

mean 

median 

std  deviation 

owner_NF.T 

1 

2152 

359 

60 

570 

owner_SYSTHM 

1 

17519 

373 

24 

1340 

owner_USER 

1 

675 

17 

3 

59 

no  cut 

1 

17519 

215 

8 

1 

970 

Table  25:  Number  of  executes/active  executable 


An  executable  file  saw  considerably  more  activity  than  other  regular  files  (figure  21  and  table  25).  Almost 
half  were  executed  10  times  or  more.  This  is  not  surprising,  considenng  the  small  number  of  active 
executables. 


fraction 

of 

executes 


owner_USER 
owner_NET 
owner_SYSTEM 
no  cut 


number  of  executes  to  file 

Figure  22;  Fraction  of  executes  per  active  executable  (cumulative) 


distnbution 


median  sid  dev 


owner  NET 


owner  SYSTEM 


owner  USER 


Table  26;  execute  distribution  (as  a  function  of  executes/executable) 


Most  executes  went  to  files  executed  a  large  number  of  times.  Half  went  to  files  executed  more  than  2000 
times  (figure  22  and  table  26)  and  95%  went  to  files  executed  at  least  100  times. 

The  most  frequently  executed  files  on  Seneca  were  shells  and  system  utilities  to  delete  files,  evaluate 
conditionals,  list  directories  and  distribute  files  to  other  machines  (uble  A-2  in  appendix  A).  Over  half  of 
the  executes  went  to  only  13  files.  These  files,  taken  together,  occupied  0.46MB  of  disk  space  (0.08%  of  the 
total).  This  suggests  that  even  a  very  modest  amount  of  caching  or  other  special  treatment  for  frequently 
requested  programs  will  produce  significant  improvements.  Evidence  for  this  was  also  seen  in  a  study  of 


fraction 

of 

intervals 


distnbuuon 

min 

max 

mean 

median 

std  deviation 

owner_NHT 

0.05 

2.2e5 

1160 

65 

6550 

owner.SYSl  FM 

0 

5.6e5 

983 

40 

7680 

owner_U'SFR 

0.52 

4.1e5 

6290 

730 

23600 

no  cut 

0 

5.6e5 

1170 

47 

8610 

Table  27:  Interexecute  intenals  (seconds) 


2MB  diskless  Sun  workstations  running  a  version  of  UNIX  similar  to  the  one  on  Seneca  at  the  University 
of  Washington  [Lazowska  84).  For  the  Suns  studied.  80%  of  the  bytes  transferred  were  due  to  file  accesses 
and  only  20%  were  for  paging.  If  we  take  our  average  executable  sire  times  the  execute  rate  (most  4.2BSD 
executables  are  loaded  using  demand  paging),  we  get  a  very  crude  paging  estimate  of  7500  bytes/second,  or 
about  170%  of  the  transfers  due  to  opens  (table  8).  The  difference  between  our  crude  estimate  and  the 
behavior  seen  at  the  University  of  Washington  is  probably  due  to  both  the  caching  of  pages  of  frequently 
executed  files  and  to  code  and  debugging  information  in  executables  that  is  not  used. 


The  distnbution  of  time  between  executes  for  executables  is  given  in  figure  23  and  table  27.  These 
distributions  lend  support  to  our  caching  arguments  (at  least  for  selected  executables  owned  by  net  and 


process  lifetime  (seconds) 


Figure  24:  Process  lifetimes  (cumulative) 


distribution 

min 

max 

mean 

median 

std  deviauon 

ruid_NET 

0.02 

7250 

15.7 

3.0 

88 

ruid.SYSTEM 

0.01 

215000 

52.2 

0.09 

1380 

ruid_USER 

0.02 

76100 

118 

2.4 

1190 

no  cut 

0.01 

215000 

165 

0.95 

_ 

2560 

Table  28:  Process  lifetimes  (seconds) 


Executing  a  program  on  UNIX  is  usually  done  using  the  sequence  fork  (to  create  a  copy  of  the  running 
process):  execv  or  execve  (to  replace  that  copy  with  the  new  program);  exit  (when  done).  Since  over  2/3  of 
the  forks  on  Seneca  were  followed  by  an  execute  we  can.  by  looking  at  process  lifetimes  (time  from  fork  to 
exit)  estimate  how  long  executables  were  in  use.  Process  lifetime  distributions,  cut  by  the  ruid  of  the 
requester,  are  given  in  figure  24  and  table  28*.  Over  half  of  all  processes  recorded  in  the  log  lived  less  than 
a  second.  System  processes  were  particularly  short-lived.  With  the  exception  of  the  large  number  of  system 
processes  that  lived  less  than  a  tenth  of  a  second  (due  mostly  to  local  network  servers)  our  results  agree 
with  process  lifetime  results  given  by  Zhou  et  al.  [Zhou  85). 


*Some  processes,  such  as  login  shells,  start  life  in  one  ruid  class  and  exit  in  another  These  are  included  only  in  the  overall  distn- 
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Executable  files  were  more  heavily  shared  than  opened  files  (table  29).  This  should  come  as  no  surpnse. 
since  there  were  relatively  few  executables  and  there  were  usually  located  in  public  directories.  User 
executables  were  relatively  lightly  shared. 


cut 

executors 

inversions 

mean 

median 

>1 

>5 

max 

mean 

>0 

>5 

max 

owner_NET 

IB 

4 

70.6% 

41.2% 

45 

92.9 

70.6% 

55.9% 

957 

owner_SYSTEM 

111 

2 

61.5% 

34.4% 

111 

126 

61.5% 

38.8% 

6539 

owner_USER 

1.38 

1 

9.8% 

1.6% 

28 

1.74 

9.8% 

4.7% 

205 

no  cut 

7.2 

1 

39.2% 

20.3% 

111 

69.5 

39.2% 

Table  29:  Executable  sharing 


6.3.  User  File  Patterns 

In  this  section  we  take  a  closer  look  at  user  files.  Many  distributed  file  systems  (including  Roe  [Ellis  83]  and 
the  ITC  DFS  [Satyanarayanan  85])  deal  primarily  or  wholly  with  user  files.  In  addition,  we  expect  that  user 
file  access  patterns  will  be  less  dependent  on  the  operating  system  used.  These  factors  make  user  file 
reference  patterns  particularly  interesting. 

The  results  presented  in  this  section  are  actually  for  user  references  to  user  files 
(owner_USER  +  ruid_USER  cut,  referred  to  as  the  "U"  cut  below).  These  references  represented  over  90% 
of  the  references  to  user  files.  The  remaining  references  were  mostly  infrequent  penodic  references  made 
by  system  processes  and  had  little  effect  on  the  distributions  we  see  (with  the  excepuon  of  some  of  the 
sharing  results).  The  organization  of  this  section  follows  closely  that  of  section  6.1. 

6.3.1.  Basic  Statistics  for  User  Files 

The  majority  (62%)  of  user  references  to  user  files  were  to  perm  files  (table  30),  even  though  less  than  a 
third  of  the  referenced  user  files  were  perm  files.  There  were  few  references  to  log  files.  Most  of  these  files 
were  logs  of  mail  sent  or  read  and  so  the  low  level  of  activity  is  not  surprising.  With  the  exception  of  a 
somewhat  higher  proportion  of  perm  files,  these  figures  agree  with  what  we  saw  for  the  overall  distributions 
(table  5). 


cut 

opens 

%  opens 

files 

%  files 

opens/file 

U  +  file_LOG 

837 

101 

0.3% 

8.3 

U  +  file_PER.M 

62.4% 

8662 

29.0% 

7.5 

U  +  file_TEMP 

36.8% 

21127 

70.7% 

1.8 

U 

104308 

100% 

3.5 

Table  30:  User  opens  to  user  files 
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cut 

- 1 

read-only 

write-only 

read/write 

total 

opens 

fraction 

opens 

fraction 

opens 

fraction 

opens 

U-fifile_[.OG 

117 

14.0% 

623 

74.4% 

97 

11.6% 

837 

U  +  file_PERM 

50193 

77.5% 

13296 

20.5% 

1310 

2.0% 

64799 

U -1- file_TEMP 

7349 

19.1% 

19891 

51.8% 

11140 

29.0% 

38380 

U 

57659 

55.4% 

33810 

32.5% 

12547 

12.1% 

104016 

Table  31:  Modes  of  open  for  user  open-close  sessions  to  user  files 


category 

opens 

%  opens  1 

files 

%  files 

opens/file 

library 

2036 

3.1%  ■ 

9! 

"~1.1% 

22.4 

manual  pages 

776 

1.2% 

181 

2.1% 

program  source 

10538 

16.2% 

1486 

P.1% 

7.1 

includes 

3093 

be 

306 

3.5% 

10.1 

objects 

5617 

8.6% 

467 

5.4% 

12.0 

personal  configuradon 

20278 

31.2%  1 

1638 

18.9% 

12.4 

mail  spool 

2049 

3.1%  ! 

453 

5.2% 

4.5 

Other 

; _ L 

20644 

bo 

4040 

46.6% 

_ 

Table  32;  f  unction  of  opened  user  perm  files 


55%  of  the  opens  were  read-on'y  with  most  of  the  read-only  opens  going  to  perm  files  <uble  31).  Users 
showed  a  strong  tendency  to  open  perm  files  read-only  and  other  files  wnie-only  or  rcad/wnte. 

30%  of  the  activity  to  perm  files  was  to  program  development  files  ("program  source."  "includes,"  and 
"objects"  in  table  32).  A  similar  number  of  references  were  to  personal  configuration  files  (often  referred  to 
as  "dot  files").  .Most  of  the  rest  of  the  references  were  unidenufiable. 


overall  (r-t-w) 


cut 

bytes/scc 

fraction 

bytcs/sec 

fraction 

bytcs/sec 

fraction 

U  + file.  LOG  1 

9.6 

1.0% 

4.6 

1.1% 

14.2 

1.0% 

U  +  file_PERM 

401 

41.0% 

121 

28.6% 

522 

37.3% 

U  +  fiIe_TFMP 

568 

58.0% 

297 

70.4% 

865 

61.7% 

U 

978 

100% 

423 

100% 

1401 

100% 

no  cut  (table  8) 

4190 

- 

800 

- 

4990 

- 

Table  33:  Bytes  read/written  by  users  to  user  files 

6.3.2.  Per  Open  Results  for  User  Files 

User  open  activity  to  user  files  (figure  25)  showed  a  busy  period  dunng  the  work  day,  with  activity  tapenng 
off  in  the  late  evening.  This  is  typical  of  a  university  environment.  There  was  some  early  morning  activity 
due  to  user  background  jobs.  The  overall  level  of  acuvity  was  much  less  than  what  we  saw  for  the  system  as 
a  whole  (user  opens  to  user  files  accounted  for  14%  of  the  open  activity)  and  was  generally  less  bursty. 

User  reads  and  writes  to  user  files  accounted  for  28%  of  the  bytes  transferred  during  the  logging  penod. 
Most  of  the  transfers  (61.7%)  were  to  and  from  temp  files  (table  33).  Few  bytes  were  transferred  to  or  from 


file  si/e  fbvtcs) 

Figure  26:  Dvnamic  file  size  distributions  (cumulative,  measured  at  close.  U  cut) 


distnbuiion 


std  deviation 


U  +  file_LOG.  dynamic 
L  +  file_PHRM,  dynamic 
U  +  file_TK.VIP.  dynamic 
U,  dynamic 


all,  dynamic  (table  9) 
all.  static 


1.28e6 

2.49e6 

1.30e6 

2.49e6 


2.49e6 

7,95e6 


Table  34:  User  file  si/e  distributions 


Cumulative  file  size  distributions  for  users  files,  weighted  by  the  number  of  user  opens  and  cut  by  the  file 
class,  are  given  in  figure  26  and  table  34.  Referenced  user  files  were,  on  average,  smaller  than  other 
referenced  files.  This  was  due,  in  part,  to  the  large  number  of  zero  length  temp  files  and  to  the  absence  of 
the  large,  frequently  accessed  administration  files  seen  in  the  overall  data. 

Users  accessed  most  of  their  files  completely  (figures  27-30  and  tables  35-38).  90%  of  opens  with  read  access 
(read-only  or  read/write)  resulted  in  the  file  being  completely  read  (compared  to  68%  for  the  system  as  a 
whole).  83%  of  files  opened  with  write  access  were  completely  wntten  (compared  to  78%  for  the  system  as  a 
whole).  Nearly  all  files  opened  read-only  were  completely  read. 


percent  read 


I 


distribution 


U  +  file_LOG 
U  +  file_PERM 
U  +  file_TEMP 
U 


no  cut  (table  11) 


mean  median  std  dev  <1(X)% 


75.3 

100 

40 

28% 

0% 

99.9 

100 

110 

5.7% 

5.2% 

109 

100 

47 

7.2% 

11.4% 

12500 

100.9 

100 

104 

5.9% 

6.0% 

64100 

83.9 

100 

202 

31% 

2.9% 

Table  35:  Percentage  of  users  files  read  (read-only  opens) 


distribution 


U-l-file_LOG 
U  +  file_PERM 
U  +  file_TEMP 
U 


median  std  dev 


100  12.4 

200  90.8 

9600  106.4 

9600  i  95.6 


<100%  >100% 


0% 

0.8% 

0.7% 

0.8% 


no  cut  (table  12) 


Table  36;  Percentage  of  user  files  written  (write-only  opens) 


distribution 


U  +  file_LOG 
U  +  file_PERM 
U -I- file  TEMP 


mean  median  I  std  dev  |  <100%  '  >100% 


65000 


no  cut  (table  13) 


Table  37:  Percentage  of  user  files  read  (read/write  opens) 


distribution 


U-t-file_LOG 
U-l-file_PERM 
U  +  file_TEMP 
U 


no  cut  (table  14) 


mean  median  std  dev  <1(X)%  >100% 


Table  38:  Percentage  of  user  files  written  (read/write  opens) 


L+file_PHRM 
L  +  file_I.OG 
L+filc_THMP 
L' 


I 


5  10 

number  of  opens  to  file 

Figure  31:  Number  of  opens  per  active  file  (cumulative.  L  cut) 


distribution 

mean 

median 

opened 

once 

opened 

twice 

opened  more 
than  twice 

max 

U  +  file_UOG 

8.3 

5 

13% 

18% 

69% 

79 

U-t-file_PERM 

7.5 

3 

23% 

21% 

56% 

562 

U  +  file_TEMP 

1.8 

35% 

61% 

3.9% 

198 

U 

3.5 

: 

31% 

50% 

19% 

562 

no  cut  (table  17) 

•".5 

1 

48% 

_ 

33% 

19% 

26800 

Table  39:  Number  of  user  opens/user  file 


6.3.3.  Per  File  Results  for  User  Files 

User  temp  files  were  generally  accessed  twice.  User  log  and  perm  files  saw  somewhat  more  acuvity  (figure 
31  and  table  39).  Although  only  19%  of  the  user  files  seen  w-ere  referenced  more  than  twice  dunng  the 
week  of  logging,  these  files  accounted  for  63%  of  the  opens.  User  file  distnbuuons  don't  show  the  frantic 
acuvity  to  a  few  files  that  we  saw  for  the  overall  distribution,  but  there  was  still  a  small  group  of  relatively 
active  files  that  accounted  for  the  majonty  of  the  opens. 


intervals 


0.1  1  10  100  10( 
time  since  last  open  (seconds) 

Figure  32;  File  interopen  intervals  (cumulative.  U  cut) 


10000  lOOC 


distribution 

min 

max 

mean 

median 

std  deviation 

U  +  file.LOG 

0.02 

5.4e5 

31100 

3100 

5.7e5 

U  +  file_PF.RM 

0 

5.4e5 

21400 

450 

4.1e4 

U  +  file_TEMP 

0.01 

4.2e5 

1390 

0.38 

1.2e4 

U 

0 

5.4e5 

16900 

120 

3.7e4 

no  cut  (table  20) 

0 

5.4e5 

7502 

1 

60 

2.2e4 

Table  40:  User  file  interopen  intervals  (seconds) 


Interopen  intervals  for  user  files  (figure  32  and  table  40)  bore  little  resemblance  to  the  results  we  saw  for 
the  overall  data.  Intervals  for  user  files  could  icrally  be  expressed  in  minutes  instead  of  seconds.  Temp 
files  were  an  exception  here.  The  second  opc.  to  a  temp  file  usually  followed  immediately  after  the  first 
one. 


File  and  data  lifeumes  for  user  files  are  shown  in  figures  33  and  34.  Most  user  perm  and  log  files  had  lives 
exceeding  our  logging  period.  Data  in  user  log  and  perm  files  was  also  long  lived  (this  was  not  the  case  for 
the  overall  dau).  Half  of  all  user  temp  files  lived  less  than  15  seconds. 


Tables  41  and  42  provide  some  statistics  on  user  sharing  of  user  files.  Sharing  was  restricted  to  log  and 
perm  files.  The  low  mean  number  of  inversions  (0.069)  indicates  that  sharing  was  incidental  to  the  normal 
use  of  user  files. 


cut 

readers 

writers 

users  (r  |  w) 

inversions 

mean 

>1 

mean 

>1 

mean 

>1 

mean 

max 

U  +  file_LOG 

fsa 

3.0% 

0.99 

3.0% 

1.099 

5.9% 

0.356 

9 

U  +  file_PERM 

mm 

2.7% 

0.632 

2.1% 

1.12 

4.9% 

0.231 

163 

U  +  file_TEMP 

0.798 

0.02% 

0.995 

0% 

1.0 

0.04% 

wm 

2 

U 

0.838 

0.80% 

0.889 

0.63% 

1.035 

1.5% 

163 

no  cut  (table  21) 

0.792 

6.6% 

0.930 

2.4% 

1.30 

11.8% 

Table  41:  User  file  sharing 


number 

readers 

waters 

users  (r  1  w) 

inversions 

count 

cum 

count 

cum 

count 

cum 

count 

cum 

0 

5367 

3826 

12.8% 

- 

- 

29452 

98.5% 

1 

24284 

25877 

99.37% 

29452 

98.5% 

198 

99.69% 

98 

99.70% 

255 

99.39% 

109 

99.56% 

■Q 

99.84% 

28 

99.80% 

74 

99.64% 

27 

99.65% 

IH 

99.91% 

19 

99.86% 

40 

99.77% 

23 

99.73% 

5  i 

11 

99.95% 

8 

99.89% 

19 

99.83% 

14 

99.78% 

99.95% 

7 

99.91% 

9 

99.86% 

7 

99.96% 

7 

99.93% 

10 

99.90% 

5 

99.82% 

99.97% 

3 

99.94% 

5 

99.91% 

7 

99.84% 

99.98% 

5 

99.96% 

7 

99.94% 

7 

99.86% 

99.98 

2 

99.97% 

2 

99.94% 

4 

99.88% 

BBi 

10 

100% 

17 

100% 

37 

29890 

- 

29890 

- 

29890 

- 

29890 

- 

Table  42:  readers,  writers,  users  and  inversions:  user  references  to  user  files 


7.  Implications  for  DFS’s 


In  this  section  we  make  some  observations  on  DKS  design,  based  on  the  results  we  have  presented.  It 
should  be  emphasized  that  these  observations  and  suggestions  are  most  applicable  to  systems  that  see 
reference  patterns  similar  to  ours.  Tliey  will  not  necessanly  carry  over  to  other  environments. 

The  small  median  size  of  opened  files  (710  bytes)  suggests  that  the  overhead  to  traverse  a  directory  and 
then  actually  open  a  file  will  tend  to  dominate  file  access  time.  Careful  directory  design  and  low 
communication  requirements  for  opens  will  be  needed  to  minimize  this  overhead. 

The  high  percentage  of  a  file  that  was  read  or  wntten  tells  us  that  migrating  a  file  as  a  whole  is  usuailv 
appropriate.  Log  files  are  an  exception’.  In  this  case,  information  about  the  intended  use  of  the  file  would 
be  helpful. 

For  files  that  were  not  completeK  read  or  written,  the  fraction  accessed  depended  su'ongly  on  the  access 
mode  (read-only,  write-only  or  rcad/wnte).  the  size  of  the  file  and  the  opener  of  the  file.  Systems  limited 
by  bandwidth  considerations  may  benefit  from  using  this  information  in  making  migration  decisions. 

Reads  accounted  for  84*^  of  the  bytes  transferred  in  the  system.  Many  of  these  reads  were  from  large 
administrative  files  that  were  frequently  read  and  rarely  wntten.  Replication  and  caching  of  even  a  few  such 
files  could  substantially  increase  the  performance  of  a  DFS. 

Most  temp  files  in  our  environment  were  opened  only  once  or  twice.  These  files  were  also  short  lived, 
generally  existing  for  only  a  few  seconds.  Many  other  files  were  only  used  a  few  times  during  the  logging 
penod.  Knowing  the  intended  use  of  these  files  at  the  time  of  their  creation  could  substantially  increase 
the  performance  of  a  DFS.  There  is,  for  example,  no  need  to  replicate  a  temp  file  and  files  that  are 
infrequently  used  or  short-lived  will  usually  benefit  from  different  initial  placement  decisions. 

The  short  interopen  intervals  seen  here  (median  of  60  seconds)  suggests  that  fast  response  to  changing 
patterns  is  important.  DFS's  that  migrate  or  replicate  a  file  at  open  time  are  often  doing  the  right  thing. 
User  files  had  substantially  longer  interopen  interv'als.  In  some  situations,  fast  response  time  will  be  less 
important  for  these  files. 

The  bursty  nature  of  requests  (for  our  background  activity  in  particular)  means  that  congestion  could  be  a 
serious  problem  at  times.  Preliminary  results  from  the  VICE/Andrew  system  {Svobodova  85|  confirm  the 
importance  of  this  issue.  It  will  be  interesting  to  investigate  algonthms  that  place  and  migrate  files  to 
minimize  congesuon. 

User  files  accounted  for  only  15%  of  the  open  activity  on  our  system.  For  DFS's  that  support  access  to  local 
file  systems  coupled  with  access  to  a  global  user  file  system,  minimizing  the  performance  impact  of  adding 
the  global  file  system  on  local  accesses  is  clearly  important. 

Net  and  system  files  made  up  the  bulk  of  shared  files.  Relatively  few  user  files  were  shared  and  this  sharing 
was  incidenul  to  their  normal  use.  Overall,  only  about  12%  of  the  files  on  the  system  were  shared.  This 
suggests  that  there  is  no  need  for  replication  to  improve  performance  for  most  files  (replication  for 
increasing  availability  is  another  issue,  though). 

One  might  prefer  to  use  a  different  logging  mechanism  in  a  distnbuted  environment  in  any  case 
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Over  half  of  all  execute  requests  made  went  to  just  13  files.  These  files  occupied  0.46MB  of  disk  space 
(0.08%  of  the  total).  This  suggests  that  even  a  very  modest  amount  of  caching  or  other  special  treatment 
for  such  files  will  produce  significant  improvements  in  system  performance. 

There  were  generally  substantial  differences  in  access  patterns  for  log.  permanent  and  temporary  files. 
Placement  and  migration  algorithms  will  benefit  from  recognizing  and  ruthlessly  exploiting  these 
differences. 

8.  Further  Work 

The  analysis  of  file  system  traces  can  soak  up  boundless  amounts  of  time  and  energy.  We  have  tried  to  stop 
at  the  point  were  we  felt  that  we  had  enough  information  to  understand  trace  driven  simulations  based  on 
the  data.  There  is  a  great  deal  of  further  work  that  could  be  done.  Some  possibilities  include: 

(1)  Studies  of  open  frequency  as  a  function  of  file  age.  Smith  found  that  for  long  term  file 
reference  patterns,  open  frequency  falls  off  as  the  age  of  the  file  increases  [Smith  81].  A  recent 
survey  of  files  on  Seneca  showing  that  66%  of  all  user  files  (user  log  and  perm  files)  hadn't 
been  accessed  in  over  one  month  [Friedberg  85]  suggests  that  this  is  also  true  in  our 
environment  for  at  least  some  classes  of  files. 

(2)  Studies  of  interopen  intervals  as  a  function  of  file  size.  Porcar  found  that  smaller  files  tend  to 
have  shorter  intcropen  intervals  [Porcar  82).  We  don't  expect  this  to  be  true  for  the  overall 
activity  in  our  system  (because  of  the  large  heavily  used  administrative  files),  but  it  may  be  true 
for  user  files. 

(3)  Measunng  the  paging  and  mode  access  activity.  It  would  be  interesting  to  see  what  fraction  of 
the  file  system  bandwidth  is  devoted  to  each  of  these  activities. 

(4)  Examining  in  more  detail  the  activity  per  user.  Ousterhout  et  al.  [Ousterhout  85]  have  done 
some  of  this  work. 

(5)  Using  the  trace  data  to  drive  simulations  investigating  file  system  performance  issues.  A  trace 
driven  simulation  of  Roe  [Ellis  83]  is  planned. 

(6)  Fitting  curves  to  various  distnbutions  (size,  inter-open  time  and  so  on).  These  would  be  useful 
m  wnting  synthetic  drivers  for  use  in  sunulating  DFS's  [Satyanarayanan  83]. 

(7)  Further  data  collection  and  analysis  for  different  environments  and  work  loads.  This  would 
give  us  a  better  feeling  for  where  our  data  fits  into  the  universe  of  file  system  usage. 

9.  Summary 

This  paper  has  described  in  detail  the  collection  and  analysis  of  short  term  file  reference  data  from  a 
4.2BSD  UNIX  system  supporting  university  research.  Our  major  findings: 

(1)  Opened  files  in  our  environment  are  small,  with  half  being  under  710  bytes  long. 

(2)  The  majority  of  bytes  read  come  from  larger  files  (greater  than  20.000  bytes  long). 

(3)  68%  of  files  opened  with  read  access  are  completely  read  and  78%  of  files  opened  with  write 
access  are  completely  wntten.  The  percentage  read  and  written  depends  strongly  on  the  class  of 
the  file  (log.  perm  or  temp),  the  mode  of  open,  the  file  opener  and  the  size  of  the  file.  In 
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particular,  log  files  are  almost  never  completely  written  and  users  completely  read  94%  of  files 
they  open  read-only. 

(4)  Temporary  files  are  usually  accessed  only  once  or  twice  and  most  live  for  less  than  a  minute. 
Log  and  permanent  files  live  for  much  longer  periods  and  see  more  open  activity. 

(5)  Most  opens  go  to  files  opened  hundreds  or  thousands  of  times  a  week.  Large  administrative 
files  account  for  a  substantial  fraction  of  this  activity. 

(6)  Files  are  generally  open  for  only  a  few  tenths  of  a  second. 

(7)  Interopen  intervals  in  our  environment  are  short.  Half  are  under  60  seconds.  The  interopen 
interval  depends  strongly  on  the  class  (log.  permanent  or  temporary)  and  owner  of  the  file. 

(8)  Most  sharing  is  resuicted  to  system  and  net  files  in  our  environment.  Sharing  of  user  files  is 
incidental  to  their  normal  use. 

(9)  Executed  files  are  rclativek  large  (half  arc  over  20,000  bytes),  heavily  used  and  few  in  number. 

(10)  Half  of  all  execute  requests  go  to  a  very  small  number  of  executable  files  (13  files:  2.2%  of  the 
referenced  executables I. 

(11) We  see  substantial  differences  in  file  access  patterns  based  on  the  class  of  the  file,  the  owner  of 
the  file  and  the  class  of  the  file  opener.  In  particular,  overall  reference  patterns  do  not  match 
user  file  reference  patterns  and  reference  patterns  for  logs,  permanent  files  and  temporary  files 
bear  little  resemblance  to  each  other. 

TTiese  results  have  a  number  of  interesting  implications  for  DFS  design.  These  implications  are  discussed 
in  section  7. 

As  is  true  with  all  studies  of  this  sort,  our  results  can  be  guaranteed  to  be  valid  only  for  our  system  at  the 
time  of  data  collection.  Care  should  be  taken  in  applying  the  results  to  other  situations. 
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Appendix  A.  Frequently  Opened  and  Executed  Files 

The  following  two  ubles  list  the  most  frequently  opened  and  executed  files  during  the  logging  period. 
What  are  actually  listed  here  are  the  most  frequently  accessed  inodes,  with  the  given  name  being  the  path 
used  to  first  access  the  inode.  For  the  most  pan,  this  distinction  doesn’t  matter.  There  are  a  few  inodes 
listed  here,  though,  that  are  one  of  several  versions  of  a  heavily  used  system  file.  An  example  is 
/etc/passwd,  which  starts  life  as  /etc/ptmp.  Occurrences  of  this  are  noted  in  the  table. 


opens 


fraction 


path  of  first  open 


/etc /hosts 

/usr/spool/rwbo/whod.keuka 
[2  more  rwho  daemon  files] 

/etc/passwd  [35485  (7.1%)  with  /etc/ptmp  versions] 
/etc/utmp 

/usr/spool/rwho/  v  hod.capella 
[9  more  rwho  daemon  .files] 

/usr/include/whoami.h 
/etc/ptmp  [version  of  /etc/passwd] 

/etc/ptmp  [version  of  /etc/passwd] 

/usr/lib/sendmail.st 

/vmunix 

/etc/termcap 

/etc/group  [6947  (1.4%)  for  all  versions] 

/etc/services 

/etc/gettytab 

/etc/ttys 

/usr/lib/uucp/L.sys 

/usr/lib/news/sys 

/usr/lib/uucp/L.aliases 

/usr/adm/lastlog 

/etc/hosts.equiv 

/usr/lib/news/nactive  [9830  (2.0%)  for  all  versions] 
/bin/irue 

/usr/lib/uucp/SEQF 

//.cshrc 

/usr/lib/news/nacu\e 

/usr/lib/aliases.dir 

/usr/lib/aliases.pag 

/usr/lib/news/nactive 

/usr/lib/sendmail.cf 


Table  A-l:  Frequently  Opened  Inodes 


executes 


fraction 


0.70% 

0.65% 

0.65% 

0.63% 

0.63% 

0.62% 

0.60% 

0.59% 


path  of  first  execute 


/bin/sh 

/bin/rm 

/bin/[ 

/bin/csh 

/bin/ls 

/etc/rdist 

/usr/ucb/more 

/usr/ucb/vi 

/bin/login 

/bin/echo 

/usr/bin/mews 

/usr/lib/sendmail 

/etc/Iogld 

/bin/hosuiame 

/bin/rmdir 

/usr/ucb/mail 

/usr/lib/uucp/uuxqt 

/usr/lib/uucp/uucico 

/bin/stty 

/usr/ucb/tset 

/bin/mail 

/usr/bin/uux 

/bin/cat 

/bin/eflpsend 

/etc/getty 

/etc/dmesg 

/usr/lib/news/batch 

/usr/ucb/clear 

/bin/mkdir 

/bin/awk 

/usr/bin/uux 

/etc/getty 

/bin/rmail 

/lib/cpp 

/usr/bin/basename 

/usr/ucb/uptime 

/bin/cc 

/bin/date 

/usr/lib/news/compress 

/bin/as 


Table  A-2:  Frequently  Executed  Inodes 


Appendix  B.  Selected  Histograms  and  Distributions,  in  Detail 


The  first  two  tables  in  this  appendix  (table  B-1  and  table  B-2)  give  information  on  opens  and  bytes 
transferred  for  each  of  the  14  cuts  described  in  section  4.  Some  of  the  information  in  these  tables  appeared 
in  the  main  body  of  the  paper  and  is  included  again  here  for  comparison  purposes. 

The  remainder  of  the  appendix  gives  a  more  complete  set  of  distributions  for  file  sizes,  percent  read  and 
written,  open  counts,  open  time,  interopen  intervals  and  lifetimes.  Distributions  for  all  of  our  cuts  are 
given.  Again,  some  of  this  information  also  appears  in  the  body  of  the  paper. 


cut 

opens 

%  opens 

files 

%  files 

opens/file 

ruid_NET 

249825 

33.1% 

HBB 

51% 

4,8 

ruid_SYSTEM 

298186 

39.5% 

15% 

19.2 

ruid_USER 

206274 

27.3% 

45% 

4.5 

owner_NET 

249733 

33.1% 

46207 

45.7% 

5.4 

owner_SYSTEM 

392790 

52.1% 

25062 

24.8% 

15.7 

owner_USER 

111762 

14.8% 

30822 

30.5% 

3.6 

file_LOG 

35662 

4.7% 

506 

0.5% 

70.5 

file_PERM 

499193 

66.2% 

16352 

16.2% 

30.5 

file_TEMP 

219430 

29.1% 

84327 

83.3% 

2.6 

U  +  file_LOG 

837 

0.1% 

101 

0.1% 

8.3 

U  +  file_PERM 

65051 

8.6% 

8662 

8.6% 

7.5 

U  +  file_TEMP 

38420 

5.1% 

21127 

20.9% 

1.8 

U 

104308 

13.8% 

29890 

29.5% 

3.5 

no  cut 

754285 

100% 

101185 

100% 

7.5 

reads 

bytes/sec 

fraction 

870 

21% 

1060 

25% 

2260 

54% 

845 

20% 

j  2330 

56% 

1015 

24% 

45 

1.1% 

3225 

77% 

920 

22% 

9.6 

0.2% 

400 

10.0% 

570 

14% 

980 

23% 

4190 

100% 

overall  (r+ w) 


bytes/sec 


250 

no 

440 


245 

130 

425 


11 

285 

505 


fraction 


31% 

14% 

55% 


31% 

16% 

54% 


1.4% 

35% 

63% 


0.6% 

15% 

37% 

53% 


00% 


bytes/sec 


fraction 


22.5% 

23.5% 

54% 


22% 

49% 

29% 


1.1% 

70% 

29% 


0.3% 

10% 

17% 

28% 


100% 


rui(l_NET 

ruid.SYSTEM 

ruid_USER 


owner_NET 

owner_SYSTEM 

owner_USER 


file_LOG 

file.PERM 

file_TEMP 


U  +  file_LOG 
U  +  file_PERM 
U  +  file_TEMP 
U 


1440 


56 

3510 

1425 


14.2 

520 

870 

1400 


4990 


Table  B-2;  Bytes  read/wntten  for  regular  files 


B.2.  Dynamic  File  Sizes 


distribution 

min 

max 

I   ■  - 

mean 

median 

sid  deviadon 

ruid_NET 

ruid_SYSTEM 

ruid_USER 


owner_NET 

owner_SYSTEM 

owner_USER 


file_LCX} 

file.PERM 

file.TEMP 


U+file_LOG 
U  +  file_PERM 
U  +  file_TEMP 
U 


no  cut 
static 


file  size  (bytes) 


Figure  B*l:  Dynamic  file  size  distributions  (cumulative,  measured  at  close,  ruid  cut) 


Figure  B-2:  Dynamic  file  size  distributions  (cumulative,  measured  at  close,  owner  cut) 
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Figure  B-6.'  Percent  of  file  read  for  read-only  opens  (cumulative,  owner  cut) 
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Figure  B-8:  Percent  of  file  read  for  read-only  opens  (cumulati>e,  U  cut) 


B.4.  Percentage  Written  (Write-Only  Opens) 


distribution 

ruid_NET 

ruid_SYSTEM 

ruid_USER 

owner_NET 

owner_SYSTEM 

owner_USER 

file.LOG 

file.PERM 

file_TEMP 

U  +  file.LOG 
U  +  fiIe_PERM 
U-(-file_TEMP 
U 


median 

std  dev 

<100% 

100 

48 

41% 

100 

25 

7.1% 

100 

119 

12% 

100 

48 

42% 

100 

24 

6.3% 

100 

132 

11% 

<1 

12 

98.8% 

100 

18 

3.9% 

100 

85 

0.7% 

1.9 

26 

94% 

100 

27 

14% 

100 

201 

2.0% 

100 

134 

11% 

100 

53 

15% 

>100% 


Table  B-S:  Percentage  written  (write-only  opens) 
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0.4 


rui(l_USER 
ruid_NET 
ruid_SYSTEM 
no  cut 
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percent  written 

Figure  B-9,-  Percent  of  file  written  for  write-only  opens  (cumulative,  ruid  cut) 
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Figure  B-10;  Percent  of  file  written  for  write-only  opens  (cumulative,  owner  cut) 
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B.5.  Percentage  Read  (Read-Write  Opens) 


distribution 

min 

ruid^NET 

0 

ruid.SYSTEM 

0 

mid_USER 

0 

owner_NET 

0 

owner_SYSTEM 

0 

owner_USER 

0 

file.LOG 

0 

file.PERM 

0 

file.TEMP 

0 

median 


100 

100 

100 


38 

100 

100 


100 

100 

100 


std  dev 


104 

80 

1100 


<100% 


49% 

30% 

33% 


>100% 


U  +  file_LOG 
U  +  file_PERM 
U  +  file_TEMP 
U 


no  cut 


65500 


22 

19 


65500 


Table  B-6:  Percentage  read  (read/write  opens) 


ruid.USER 
ruid_NET 
ruid_SYSTEM 
no  cut 


percent  read 

Figure  B-13;  Percent  of  file  read  For  read/write  opens  (cumulative,  mid  cut) 
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Figure  B-17;  Percent  of  file  written  for  read/write  opens  (cumulative,  ruid  cut) 
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Figure  B-I8:  Percent  of  file  written  for  read/write  opens  (cumulative,  owner  cut) 
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Figure  B-19:  Percent  of  file  written  for  read/write  opens  (cumulative,  file  cut) 


Figure  B-20:  Percent  of  file  written  for  read/write  opens  (cumulative,  U  cut) 
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Figure  B-22:  Number  of  opens  per  active  file  (cumulative,  owner  cut) 
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Figure  B-24;  Number  of  opens  per  active  file  (cumulative.  U  cut) 


B.8.  Time  from  File  Open  to  Close 


Table  8-9:  Open  time  (seconds) 


B.9.  File  Interopen  Intervals 


distribution 


ruid.NET 

ruid.SYSTEM 

ruid.USER 


owner_NET 

owner_SYSTEM 

owner_USER 


file.LOG 

file.PERM 

file_TEMP 


U  +  file_LOG 
lj'  +  file_PERM 
L’  +  file_TEMP 
U 


min 

max 

0 

4.0e5 

0 

3.9e5 

0 

5.4e5 

0 

4.0e5 

0 

5.3e5 

0 

5.4e5 

0 

5.4e5 

0 

5.4e5 

0 

4.2e5 

0.02 

S.4e5 

0 

5.4e5 

0.01 

4.2e5 

0 

5.4e5 

0 

5.4e5 

std  deviation 


Table  B-10:  File  interopen  intervals  (seconds) 


fraction 

of 

intervals 


B.IO,  File  Lifetimes 
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Figure  B-33;  File  lifetimes  (cumulative,  files  living  beyond  log  period  binned  at  right,  mid  cut) 
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Figure  B-36:  File  lifetimes  (cumulative,  files  living  beyond  log  period  binned  at  right,  L  cut) 


B.ll.  File  Version  Lifetimes 


Figure  B-37;  Version  lifetimes  (cumulative,  versions  living  beyond  log  period  binned  at  right,  ruid  cut) 


version  lifetime  (seconds) 


Figure  B-38:  Version  lifetimes  (cumulative,  versions  living  beyond  log  period  binned  at  right,  owner  cut) 


Figure  B-39:  Version  lifetimes  (cumulative,  versions  living  beyond  log  period  binned  at  right,  file  cut) 
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